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Abstract Many works related to Twitter aim at char¬ 
acterizing its users in some way: role on the service 
(spammers, bots, organizations, etc.), nature of the user 
(socio-professional category, age, etc.), topics of inter¬ 
est, and others. However, for a given user classification 
problem, it is very difficult to select a set of appropri¬ 
ate features, because the many features described in 
the literature are very heterogeneous, with name over¬ 
laps and collisions, and numerous very close variants. 
In this article, we review a wide range of such features. 
In order to present a clear state-of-the-art description, 
we unify their names, definitions and relationships, and 
we propose a new, neutral, typology. We then illustrate 
the interest of our review by applying a selection of 
these features to the offline influence detection problem. 
This task consists in identifying users which are influ¬ 
ential in real-life, based on their Twitter account and 
related data. We show that most features deemed effi¬ 
cient to predict online influence, such as the numbers of 
retweets and followers, are not relevant to this problem. 
However, we propose several content-based approaches 
to label Twitter users as Influencers or not. We also 
rank them according to a predicted influence level. Our 
proposals are evaluated over the CLEF RepLab 2014 
dataset, and outmatch state-of-the-art methods. 
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1 Introduction 

Social Networking Services have started to appear on 
the World Wide Web as early as the year 2000, with 
sites such as Friendster and My Space. Since then, they 
have multiplied and taken over the Internet, with hun¬ 
dreds of different services used by more than one billion 
people. Among them, Twitter is one of the most popu¬ 
lar. It is used to report live events, share viewpoints 
regarding a variety of topics, monitor public opin¬ 
ion, track e-reputation, etc. The service consequently 
dragged the attention of politicians, firms, celebrities 
and marketing specialists, which now largely base their 
communication on Twitter, trying to become as visible 
and influential as possible. 

User Classification. Due to the popularity and 
widespread use of Twitter, there are numerous reasons 
why one would want to categorize its users: market seg¬ 
mentation and marketing target identification, detec¬ 
tion of opinion trends, quality of service improvement 
(e.g. by blocking spammers), sociological studies, and 
others. But because of the diversity of Twitter users and 
of the amount of available data, there are many ways 
to do so. For these reasons, many works have been ded¬ 
icated to the characterization of Twitter profiles. 

A number of these studies aim at identifying users 
holding certain roles inside the service itself. The de¬ 
tection of spammers is very popular, due to the critical 
nature of this task regarding the quality of service. Most 
works focus on the identification of spambots, i.e. soft¬ 
ware agents working in an automated way [8,31,45,46, 
75]. The detection of crowdturfers, the human crowd- 
sourced equivalent of spambots, constitutes a related 
but less-known task [48]. The tool described in [16] dis¬ 
tinguishes regular human users, bots (robots, i.e. fully 
automated users, which can be spammers, but not nec- 
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essarily) and so-called cyborgs (computer-assisted hu¬ 
mans or human-assisted bots). Certain authors study 
social capitalists, a class of users taking advantage of 
specific strategies to gain visibility on Twitter with¬ 
out producing any valuable content. Some works focus 
on their identification [23,24,31], others on the char¬ 
acterization of their position and role in the network 
[22]. Some typologies are more detailed, for instance 
in [72], the authors distinguish 3 types of real users 
(personal, professional, business) and 3 types of digital 
actors (Spammers, Newsfeeds, Marketing services). In 
[47], the authors propose a method to detect Retweet¬ 
ers, i.e. users more likely to fetch tweets related to a 
given subject. Influence is also a topic of interest, with 
numerous works aiming at measuring it, or detecting 
influential users [4,18,77]. 

Other works categorize users relatively to real- 
world aspects. Many works focus on socio-professional 
categories: age [1,59,61], gender [1,59,61], ethnic¬ 
ity/regional origin [55,61], city [14,49], country [37, 
49], political orientation [30,1,17,50,55,61], business 
domain [55] . In [64], the authors distinguish two types 
of Twitter users (individual persons vs. organizations), 
and three in [15] (organizations, journalists, ordinary 
persons). 

Certain works categorize users not relatively to the 
whole system, but to some user of interest. This is no¬ 
ticeably the case in works aiming at recommending fol- 
lowees (i.e. which users to follow) [5,32,29,41]. Some 
works aim at simultaneously classifying users according 
to topics/categories not specified in advance, and un¬ 
cover the most relevant topic/categories themselves [38, 
57] . Finally, another category of works takes advantage 
of user-related features to improve the classification of 
tweets. For instance, several articles describe methods 
to distinguish tweets depending on the communication 
objective behind them. In [66], the authors distinguish 
News, Opinions, Deals, Events and Private messages ; 
in [53] they use 9 categories such as Information shar¬ 
ing, Self promotion, and Question to followers. 

Twitter Features. The cited studies come from a 
variety of fields: computer science, sociology, statistical 
physics, political sciences, etc. They consequently have 
different goals, and tackle the problem of user classi¬ 
fication in different ways, applying different methods 
to different data. However, the adopted approaches can 
be commonly described in a generic way: 1) identifying 
the appropriate features, i.e. the relevant data describ¬ 
ing the users ; and 2) training a classifier to discriminate 
the targeted user classes based on these features. In this 
article, we focus on the first point, i.e. the features one 
can extract from Twitter for the purpose of user classi¬ 
fication. 


Over the years, and because user classification stud¬ 
ies come from such a variety of backgrounds, a number 
of such features have been proposed for the purpose of 
user classification. Some are specific to certain research 
domains. For instance, works coming from Social Net¬ 
work Analysis (SNA) tend to focus on the way users are 
interconnected, whereas studies from Natural Language 
Processing (NLP) obviously focus on the textual con¬ 
tent of tweets. But many simple features, such as the 
number of Tweets published by a user, are widespread 
independently from the research domain. The difficulty 
for a newcomer is that, over those articles, these fea¬ 
tures may have different names when they actually are 
the same ; or vice versa (same name when they actu¬ 
ally are different) ; or one feature can be declined into 
a number of more or less similar variants. Moreover, 
it is difficult to determine which feature or variant is 
appropriate for a given user classification problem: the 
features one would use to detect spammers might not 
be relevant when trying to identify the political orienta¬ 
tion of users. For instance, during the 3rd International 
Author Profiling Task at PAN 2015 [58], which focused 
on Age and Gender identification, the organizers were 
not able to highlight a specific, particularly relevant 
feature. 

Contributions. In this article, we propose a review 
of the features used to classify Twitter users. Of course, 
performing an exhaustive survey seems hardly possi¬ 
ble, due to the number of concerned works. We however 
consider a wide range of studies and adopt a high level 
approach, focusing on the meaning of the features while 
also describing the different forms they can take. We or¬ 
ganize them in a new, trans-domain typology. As an il¬ 
lustration of how our review can be used, we then apply 
a selection of these features to a real-world problem: the 
detection of offline influential users. In other words, we 
aim to solve the problem consisting in detecting people 
which are influential in real-life, based on their Twitter 
profile and activity. To answer this question, we con¬ 
duct experiments on the CLEF RepLab 2014 dataset, 
which was designed specifically for this task. Indeed, it 
contains Twitter data including Twitter profiles anno¬ 
tated in terms of offline influence. We take advantage 
of these manual annotations to train several Machine 
Learning (ML) tools and assess their performance on 
classification and ranking issues. The former consists 
in determining if a user is influential or non-influential, 
whereas the latter aims at ranking users depending on 
their estimated influence level. 

Our first contribution is to review a large number 
of Twitter-based features used for user profile charac¬ 
terization problems, and to present them in a unified 
form, using a new typology. Our second contribution is 
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the assessment of these generic features, relatively to 
a specific problem consisting in predicting offline influ¬ 
ence. We show that most simple features behave rather 
poorly, and discuss the questions raised by this observa¬ 
tion. Finally, we describe several NLP ranking methods 
that gives better results than known state-of-the-art ap¬ 
proaches. 

Organization. The rest of this paper is organized 
as follows. In the next section (Section 2), we review the 
features related to the classification or characterization 
of Twitter users, with an emphasis on their meaning in 
this context. We also propose a typology for these fea¬ 
tures, in an effort to highlights how they are connected. 
We then focus on Twitter-based offline influence detec¬ 
tion in Section 3. We describe the problem at hand, 
the RepLab data we used in our experiments, and the 
methods we propose to solve this problem. In Section 
4, we present the results we obtained and discuss them. 
Finally, we highlight the main aspects of our work in 
the conclusion, and give some perspectives regarding 
how it can be extended. 

2 Review of Twitter-Related Features 

We present a review of the most interesting features one 
can use to characterize Twitter users. Due to the gener¬ 
ally large number of features used in a given study, au¬ 
thors often group them thematically. However, there is 
no standard regarding the resulting feature categories, 
which can vary widely from one author to the other. In 
particular, people tend to categorize features based on 
some criteria related to their field of study (i.e. mainly 
SNA and NLP). Here, we try to ignore this somehow 
artificial distinction, and propose a neutral typology. 
We do not want to be exhaustive, but rather to include 
widely used features, and to emphasize their diversity. 

Before starting to describe the features in detail, 
we need to introduce some concepts related to Twit¬ 
ter. This online micro-blogging service allows to pub¬ 
licly discuss largely publicized as well as everyday-life 
events [38] by using tweets, short messages of at most 
140 characters. To be able to see the tweets posted by 
other users, one has to subscribe to these users. If user 
u subscribes to user v, then u is called a follower of 
V, whereas u is a followee of u. Each user can retweet 
other users’ tweets to share these tweets with his fol¬ 
lowers, or mark his agreement [11]. Users can also ex¬ 
plicitly mention other users to drag their attention by 
adding an expression of the form OUserNgime in their 
tweets. One can reply to a user when he is mentioned. 
Another important Twitter feature is the possibility to 
tag tweets with key words called hashtags, which are 
strings marked by a leading sharp fff) character. 


Table 1 presents the list of all the features we re¬ 
viewed, indicating for each one: its category, a short 
description of the feature, one or several associated de¬ 
scriptors (i.e. values representing the feature), and some 
bibliographic references illustrating how the feature was 
used, when possible. Sometimes, several descriptors are 
indicated for the same feature, because it can be used in 
various ways. This is particularly true for those which 
can be expressed as a value for each tweet, for example 
the number of mentions in a tweet (Feature 25). It is 
possible to treat them in an absolute way, i.e. sum of 
the values over the considered period (e.g. total num¬ 
ber of mentions) or keep only the extreme values (e.g. 
minimal and maximal numbers of mentions). One can 
also use a relative approach by processing a central and 
a dispersion statistics (e.g. average number of mentions 
by tweet, and the corresponding standard deviation). 

2.1 Description of the Features 

This subsection describes all the features from Table 
1 in detail, considering each category separately. We 
discuss each feature and indicate how it is relevant, and 
in which context. 

2.1.1 User Profile 

Our first category gathers features related to user 
profiles. The first 4 are Boolean values representing 
whether: the user set up a profile picture (Feature 1), 
his account was officially verified by Twitter (Feature 
2), he allows other users to contribute to his account 
(Feature 3), he set up his personal Webpage (Feature 
4). The profile picture itself is also analyzed by certain 
authors, using image processing methods, to extract in¬ 
formation such as age, gender and race [37] . 

Feature 5 is an integer corresponding to the length 
(in characters) of the text the user wrote to describe 
himself. These features are good indicators of how com¬ 
mitted the user is regarding Twitter and his online 
presence. Professional bloggers and corporate accounts, 
in particular, generally fill these profile fields, whereas 
spambots, or passive users (i.e. only reading Twitter 
feeds but not producing any content) do not. Verified 
accounts tend to be owned by humans, not bots [16] . 

The content of the profile description can also be 
analyzed (Feature 8) in order to extract valuable infor¬ 
mation. For instance, in [55], Pennacchiotti & Popescu 
engineered a collection of regular expressions in order 
to retrieve the age and ethnicity of the users. 

Features 6 and 7 are the number of usernames and 
URLs appearing in the textual profile description. In¬ 
deed, certain users take advantage of this text to in- 
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Table 1 Features used to characterize Twitter users. The Descriptor column indicates which statistics can be used to represent 
the feature: Total count (Cnt), Average value (Avg), Standard deviation {Sd), Minimum value (Min), Maximum value (Max), 
Overall proportion (Prop), Set cardinality (Cardinality). The number of examples is limited to 5. 

Category 

Description 

Descriptors 

Examples 

User 

1. Profile picture 

Boolean/Image 

[55,73] 

Profile 

2. Verified account 

Boolean 

[16,45,72,73] 


3. Contributions allowed 

Boolean 

[73] 


4. Personal Webpage set 

Boolean 

[48,73] 


5. Number of characters in the profile description 

Count 

[46,48] 


6. Number of usernames in the profile description 

Count 

[62] 


7. Number of URLs in the profile description 

Count 

[62] 


8. Content of the profile description 

Text 

[55] 


9. Number of (special) characters in the username 

Count 

[46,48,55,62] 


10. Age of the profile 

Value 

[8,46,55,62,72] 


11. Twitter client 

Prop / Cnt/Boolean 

[16,23,37] 

Publishing 

12. Tweets published by the user 

Cnt/Avg/Sd/Mln/Max 

[16,46,62,61,73] 

Activity 

13. Media resources published by the user 

Cnt/Prop/Avg/Sd/Min/Max [62] 


14. Delay between two consecutive tweets of the user 

Avg/Sd/Min/Max 

[8,55,62] 


15. Self-mentions of the user 

Cnt/Prop/Avg/Sd/Min/Max 

[62] 


16. Geolocated tweets published by the user 

Prop/Cnt/Boolean 

[37,73] 

Local 

17. Topology of the follower-followee network 

Graph-related measures 

[13,46,62,69,73] 

Connections 

18. Subscription lists containing the user 

Gount 

[21,73] 


19. Ids of the user’s most recent followers/followees 

Standard deviation 

[46] 


20. Tweets published by the followers/followees 

Gnt/Avg/Sd/Min/Max 

[8,62] 

User 

21. Retweets published by the user 

Gnt/Prop/Avg/Sd/Min/Max 

[8,21,55,61,72] 

Interaction 

22. Number of times the user is retweeted by others 

Gnt/Prop/Avg/Sd/Min/Max 

[4,8,13,62] 


23. Favorites selected by the user 

Gount 

[15,62,73] 


24. Tweets of the user marked as favorite by others 

Gnt/Prop/Avg/Sd/Min/Max 

[21,62,72] 


25. (Unique) mentions of other users 

Gnt/Prop/Avg/Sd/Min/Max 

[16,46,62,64,72] 


26. Mentions by other users 

Gnt/Avg/Sd/Min/Max 

[8,13,72] 

Lexical 

27. Number of (unique) words 

Gnt/Avg/Sd/Min/Max 

[8,62,78] 

Aspects 

28. Number of hapaxes 

Gnt/Prop/Avg/Sd/Min/Max [62] 


29. Named entities 

Gnt/Prop/Avg/Sd/Min/Max [15,64] 


30. Word n-gram weighting 

Vector 

[17,18,64,73,78] 


31. Prototypical n-grams 

Vector 

[1,14,48,50,55] 

Stylistic 

32. Word length, in characters 

Avg/Sd/Min/Max 

[62] 

Traits 

33. Tweet length 

Avg/Sd/Min/Max 

[8,50,62,64] 


34. Readability of the user’s tweets 

Avg/Sd/Min/Max 

[64,78] 


35. Special characters or patterns 

Gnt/Prop/Avg/Sd/Min/Max 

[8,62,61,64,78] 


36. Number of (unique) hashtags 

Gnt/Prop/Avg/Sd/Min/Max 

[8,46,55,64,72] 


37. Number of (unique) URLs 

Gnt/Prop/Avg/Sd/Min/Max 

[16,45,55,62,72] 


38. Similarity between the user’s own tweets 

Gnt/Avg/Sd/Min/Max 

[46,48,75] 

External 

39. Number of Web search results for the user’s page 

Gount 

[18] 

Data 

40. Klout score 

Value 

[18,24] 


41. Kred score 

Value 

[24] 


dicate they have other accounts or reference several 
Websites. This can concern users with several profes¬ 
sional roles they want to distinguish, as well as users 
wanting to gain visibility through specific strategies. 


On the same note, the length of the username (Feature 
9), expressed in characters, was used in some studies 
to identify certain types of users [46,62]. For instance, 
social capitalists tend to have very long names. Certain 
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authors also focus on the number of special characters 
(e.g. hearts, emoticons) in the username [55], which may 
be characteristic of certain social categories. The name 
itself can also be a source of information: in [37], Huang 
et al. use it to infer the ethnicity of the user. 

The age of the profile (Feature 10) is likely to be 
related to how visible the user is on Twitter, since it 
takes some time to reach an influential position. It can 
also help identifying bots: in their 2012 paper, Chu et 
al. noticed 95% of the bots were registered in 2009. 

Finally, Feature 11 corresponds to the software 
clients the user favors when accessing Twitter: official 
Web site, official smartphone application, management 
dashboard tool, third party applications (Vine, Sound- 
cloud...), etc. One can consider each client as a Boolean 
value representing whether the user regularly takes ad¬ 
vantage of this tool [16,24,37]. Alternatively, it is also 
possible to select the usage frequency of the tool, ex¬ 
pressed in terms of total number or proportion of uses. 

2.1.2 Publishing Activity 

The next category focuses on the way the user behaves 
regarding the tweets he publishes. Feature 12 corre¬ 
sponds to the number of tweets he posted on the con¬ 
sidered period of time, so it represents how active the 
user is globally. Users posting a small number of tweets 
are potentially information seekers [38] . Because this 
number is generally high, certain authors prefer to con¬ 
sider the number of tweets published by day [46,55]. 
The standard deviation, minimal or maximal number 
of tweets published in a day give an idea of the reg¬ 
ularity of the user in terms of tweeting. Alternatively, 
it is also possible to specifically detect periodic posting 
behaviors, as Chu et al. did to identify bots (programs 
that tweet automatically) [16] . 

Feature 13 is the number of media resources con¬ 
tained in these tweets. One can alternatively consider 
the proportion of the user’s tweets containing a media 
resource, or one of the previously cited statistics for a 
given period of time (e.g. by day). The fact a user posts 
a lot of pictures or videos could be discriminant in cer¬ 
tain situations. For instance, the concerned user could 
be active in an image-related held such as photogra¬ 
phy, or he could tweet professionally to advertise for a 
company. 

Feature 14 is the duration between two consecutive 
tweets. It aims at representing how regularly the user 
tweets. Authors generally focus on the average delay 
and the associated standard deviation [55] , but the min¬ 
imum, maximum and median are also used [8[. 

Feature 15 is the number of mentions the user makes 
of himself. This strategy is used by users who need sev¬ 


eral consecutive tweets to express an idea, and want 
to force Twitter to group them in its graphical inter¬ 
face [33]. One can alternatively consider the proportion 
of the user’s tweets containing a self-mention, or the 
average number of self-mentions by day (or any other 
statistic listed in Table 1, like for the previous features). 

Finally, Feature 16 is the proportion of tweets pub¬ 
lished by the user which are geolocated. In certain stud¬ 
ies, the authors define it instead as a Boolean feature, 
depending on whether or not the geolocation options is 
enabled in the user’s profile [73] . Others prefer to count 
the number of distinct locations associated to the user 
[18,37]. Like Features 1-5, this feature can help dis¬ 
criminating certain types of users aiming at exhibiting 
a very complete and controlled image, or with a specific 
behavior implying the publicization of their physical lo¬ 
cation (e.g. to draw a crowd in a specific place). In [37], 
the nature of the location is used to identify the user’s 
nationality. 

2.1.3 Local Connections 

The features from this category describe how the user 
is connected to the rest of the Twitter network. Feature 
17 corresponds to the network of follower-to-followee 
relationships, which can be treated in many ways. Most 
authors extract two distinct values to represent a user: 
the number of followers (people which have subscribed 
to the user’s feed) and the number of followees (people 
to which the users have subscribed). In other words, the 
incoming and outgoing degrees of the node representing 
the user in the network, respectively. 

Some authors alternatively consider the set obtained 
by taking the intersection of the user’s followers and 
followees. For instance, Dugue and Perez [23] used it 
to distinguish regular users from so-called social capi¬ 
talists. These particular users take advantage of specific 
strategies allowing them to be highly visible on Twitter, 
while producing absolutely no content of interest. One 
of the consequences of this behavior is a strong over¬ 
lap between their followers and followees, which can be 
identified through the mentioned intersection. Further¬ 
more, this descriptor was used in Colder et al. for fol- 
lowee recommendation [32] . More generally, the friends 
and followers sets are commonly used by recommender 
systems to model the user interests [29,5]. Also note 
that a number of combinations of these set-based values 
appear in the literature. Such combinations are specifi¬ 
cally treated in Section 2.2, but the follower-to-followee 
ratio is worth mentioning, since is the most widespread 
[1,8,48,61,75,29]. 

Some other authors prefer to use the network in a 
more global way, instead of focusing only on the lo- 
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cal topology. For instance, Weng et al. [77] proposed a 
modification of the PageRank algorithm which allows 
to compute an influence score for a given topic. Java et 
al. used the HITS centralities (hub and authority mea¬ 
sures) to detect users of interest, and community detec¬ 
tion to identify groups of users concerned by the same 
topics [38]. However, these methods require to obtain 
the whole network, which is generally hardly possible. 

Subscription lists allow Twitter users to group their 
followees as they see fit, and to share these lists with 
others. Placing a user in such a list can consequently be 
considered as a stronger form of subscription. Certain 
authors thus use the number of lists to which a user 
belongs as such a feature (Feature 18). 

Like Feature 17, Feature 19 is dual, in the sense it 
can be processed for followers and for followees. It is 
the standard deviation of the ids of the people who re¬ 
cently subscribed to the user’s feed, or of the people to 
which the user recently subscribed. Spambot farms tend 
to create numerous fake accounts and make them sub¬ 
scribe to each other, in order to artificially increase their 
visibility. The fake accounts are created rapidly, so the 
associated numerical ids tend to be near-consecutive: 
this can be detected by Feature 19. 

Finally, Feature 20 is also dual, it is the numbers of 
tweets published by the user’s followers and by his fol¬ 
lowees. It represents the level of publishing activity in 
the direct neighborhood of the user of interest. Instead 
of a raw count, one can alternatively average by neigh¬ 
bor, or use one of the other statistics listed in Table 1. 
Like for Feature 12, it is also possible to consider a time 
period, e.g. the average number of tweets published by 
the user’s followers by day. 

2 . 1.4 User Interaction 

This category gathers features describing how the user 
and the other people interact. Feature 21 is the propor¬ 
tion of retweets among the tweets published by the user, 
i.e. the proportion of other persons’ messages that the 
user relayed [15,8,61]. It is also possible to consider the 
raw count of such retweets [72], or to process a time- 
dependent statistic such as the average number (or pro¬ 
portion) of retweets by day. Symmetrically, Feature 22 
is the number of times a tweet published by the user was 
retweeted by others. Alternatively, one can also use the 
proportion of the user’s tweets which were retweeted at 
least once [4[. These features represent how much the 
user reacts to external tweets, and how much reaction 
he gets from his own tweets. Alternatively, certain au¬ 
thors worked with the retweet network, i.e. a graph in 
which nodes represent users and are connected when 
one user retweets another. In [17], Conover et al. ap¬ 


plied a community detection algorithm to this network, 
in order to extract a categorical feature (the community 
to which a user belongs). 

Features 23 and 24 are related to the ability Twitter 
users have to mark certain tweets as their favorites. 
Feature 23 is the total number of favorites selected by 
the user, whereas Feature 24 is the number of times 
a tweet published by the user was marked as favorite 
by others. Considering an average value by day is not 
really relevant for the former, because the number of 
favorites is generally small. However, this (or another 
statistic) might be more appropriate for the latter, since 
the number is likely to be higher. Like the previous 
ones (Features 21 and 22), these features are related to 
the reactions caused by tweets. However, a retweet is a 
much easier and frequent operation, which gives more 
importance to favorites. 

The two last features deal with mentions, i.e. the 
fact of explicitly naming a user in a tweet. Feature 25 
is the number of mentions the user puts in his tweets. 
Certain authors count only unique mentions (i.e. they 
do not count the same mention twice), whereas others 
consider all occurrences. This feature allows identifying 
the propensity a user has to directly converse with other 
users. Spambots are also known to fill their tweets with 
many more mentions than human users [75]. Instead of 
counting the mentions, certain authors use their length. 
Indeed, as we have seen for Feature 9, the length of a 
username (mentions are based on usernames) can con¬ 
vey a relevant information. It is also possible to com¬ 
pute the proportion of the user’s tweets which contain 
mentions to other users. 

Feature 26 is symmetrical to Feature 25: it is the 
number of times the user is mentioned by others. It can 
be averaged (or any other statistic) for a given period 
of time (e.g. number of mentions by day). It can also 
be divided by the number of tweets published by the 
user, to get an average number of answers by user’s 
tweet (mentions generally express the will to answer 
another user’s tweet). This feature is interesting, but 
computationally hard to process, because for a given 
user, it basically requires parsing all tweets published 
by the other users. So, it is treatable only for small 
datasets. 

2.1.5 Lexical Aspects 

This category deals with the content produced by the 
user. A number of features can be used to describe the 
lexical aspects of the text composing his tweets. They 
are relevant when one wants to discriminate users de¬ 
pending on the ideas they express on Twitter, or how 
they express them. For instance, if a class of users tend 
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to tweet about the same topic, these features are likely 
to allow their identification. 

Feature 27 is related to the size of the user’s lexicon, 
it is the number of words he uses. It is possible to count 
all occurrences or to focus only on unique words. Alter¬ 
natively, one can also compute a statistic expressed by 
tweet (e.g. average number of unique words by tweet), 
or over a period of time (e.g. by day). Certain authors 
prefer to compare the size of the user’s lexicon to that 
of the English dictionary, under the form of a ratio [78]. 
Feature 28 is very similar, but for hapaxes, i.e. words 
which are unique to the user [62]. Put differently, this 
feature is about words only the considered user includes 
in his tweets. Instead of counting them, one could also 
consider the proportion of user’s tweets containing at 
least one hapax. 

Feature 29 corresponds to the number of named en¬ 
tities identified in the user’s tweets. Named entities cor¬ 
respond roughly to proper names, allowing to identify 
persons, organizations, places, brands, etc. In [64], de 
Silva & Riloff use the average number of occurrences by 
tweet, and treat separately each entity type (for per¬ 
sons, organizations and locations). In [15], de Choud- 
hury et al. just consider the absence/presence of entities 
(i.e. a Boolean feature) in the users’s tweets. 

Feature 30 consists in representing each user by a 
numerical vector. So, it is different from all the other 
features, which take the form of scalar values (i.e. they 
represent a user by a single value). This feature con¬ 
sequently requires to be processed differently than the 
others, as illustrated in section 3.3 when treating in¬ 
fluence. Feature 30 directly comes from the Informa¬ 
tion Retrieval field [65]. Each value in the vector corre¬ 
sponds to (some function of) the frequency of a specific 
n-gram. In our context, a n-gram is a group of n consec¬ 
utive words. In the simplest case, this value would be 
the raw term frequency, i.e. the number of occurrences 
of the n-gram for the user of interest. However, this 
frequency can be normalized in different ways (e.g. log¬ 
arithmic scale), and weighted by quantities such as the 
inverse document frequency (which measures the rar¬ 
ity of the term), resulting in a number of variants. We 
present a few of them in more details in our application 
(section 3.3). 

Many authors use unigram weighting (i.e. 1-grams, 
or single words) to take advantage of the tweets con¬ 
tent, either by itself [18] or in combination with other 
features [17,61,64,73,78]. Other authors also focus on 
bigrams (2-grams, or pairs of words) [1,61,64,73], for 
which the same weighting schemes can be applied than 
for unigrams. But it is also possible to define new ones, 
for instance by taking advantage of the cooccurrence 


graphs one can build from bigrams [18] (more details 
on this in Section 3.3). 

Instead of weights, it is alternatively possible to use 
n-grams to identify the so-called prototypical expres¬ 
sions associated to each considered class. One can then 
characterize a user by looking for these expressions in 
his tweets. Here, the word class is used in a broad sense, 
and does not necessarily refer to a category of users: 
certain authors use prototypical words to describe sen¬ 
timents [48,55,64,78], or locations [14]. Others prefer 
to focus on topic distillation, i.e. identifying simulta¬ 
neously some topics and the words that characterize 
them, and describing users in terms of their interest for 
these topics depending on their use of the corresponding 
words [2,17,77]. Moreover, the prototypical expressions 
correspond to n-grams, so certain authors focus on uni¬ 
grams [1,15,50,55] while others use bigrams [1] or even 
trigrams (3-grams, or triplets of words) [1,61]. 

2.1.6 Stylistic Traits 

The tweet content can also be described using non- 
lexical features, which are gathered in this category. 
Features 32 and 33 are the numbers of characters by 
words, and by tweet, respectively. The length of a tweet 
is also sometimes expressed in words instead of char¬ 
acters. These features can help characterizing certain 
types of users. For example, the content tweeted by 
certain spambots is just a bag of keywords without 
proper grammatical structure (e.g. [42]), which results 
in a higher average word length. 

On the same note. Feature 34 relies on a measure 
quantifying the readability of the tweet. This can cor¬ 
respond to the difficulty one would have to understand 
its meaning [78], or to the level of correctness of the 
text (lexically and/or grammatically) [64] . For instance, 
de Silva & Riloff use the latter to distinguish personal 
users from companies tweets (which are generally more 
correct) [64] . 

Feature 35 focuses more particularly on special char¬ 
acters, i.e. non-alphanumerical ones, and/or specific 
patterns such as emoticons and acronyms {LOL, LM- 
FAO). The use of special characters is typical of certain 
spammers, who substitute some characters to others of 
the same shape (e.g. e for E) in order to convey the 
same message without being detected by antispam fil¬ 
ters. Certain authors directly look for emoticons [61], 
which are not used uniformly by all classes of users: ac¬ 
cording to Rao et al., women tend to use them more. 
Some emoticons can even be processed to identify the 
sentiment expressed in the tweet [64] . Other patterns 
of interest include characters repeated multiple times 
(e.g. I am sooooo bored or what ?!!!!) [64,78], pronouns, 
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which are used by de Silva & Riloff to distinguish indi¬ 
vidual persons from organizations [64], digits [8], spam- 
related words [8]. 

Features 36 and 37 are the numbers of hashtags 
and URL, respectively. Note some authors focus only 
on unique hashtags and URL, i.e. they do not count 
the same hashtag or URL twice. It is also possible to 
compute the proportion of the user’s tweets which con¬ 
tain at least one hashtag or URL [8,16], or an average 
number of hashtags or URLs by tweet [55], or the as¬ 
sociated standard deviation [72] . User regularly tweet¬ 
ing URLs are likely to be information providers [38], 
however spammers also behave like this [8] , so this fea¬ 
ture alone is not sufficient to distinguish them. Spam¬ 
mers additionally tend to use shortened URLs to hide 
their actual malicious destination, or the fact the same 
URL is repeated many times [8,75]. Certain authors 
use blacklists of URLs in order to identify the tweets 
containing malicious ones [16,31]. 

In extreme cases, certain users like to fill their 
tweets with hashtags or URLs, much more than the 
regular users. For instance, certain social capitalists 
publish some tweets containing only hashtags, all re¬ 
lated to the strategy they apply to gain visibility and 
exhort other people to subscribe to their feed (e.g. 
#TeEimFollowBack, #Follow4Follow, cf. [23]). 

Feature 38 consists in processing the self-similarity 
of the user’s tweets, i.e. the similarity between each pair 
of tweets he published, then using a statistic such as 
the average to summarize the results [46,48]. Alterna¬ 
tively, one can also set a similarity threshold allowing to 
determine if two tweets are considered as similar, and 
count the pairs of similar tweets (or use some derived 
statistic) [75] . This feature was notably used in studies 
aiming at detecting spammers: these users tend to post 
many times the same tweets, or very similar ones [45, 
46,75]. 

2.1.1 External Data 

This category contains features corresponding to data 
not retrieved directly from Twitter. Feature 39 is simply 
the number of results returned by some Web search 
engine, which point at the user’s Webpage. 

The next two features are scores computed by pri¬ 
vate companies independent from Twitter, and aim at 
measuring (in one way or another) the influence of 
users. Of course, they differ in the definition of the no¬ 
tion of influence they rely upon. Feature 40 is the Klout 
score, that takes into account both Tweeter-related and 
external data gathered from other social networking 
services and various search engines [60] . The precise list 
of the features used to compute the Klout score was not 


published, though. The algorithm behind the Kred In¬ 
fluence Measurement [40] is open source (Feature 41). 
It is constituted of two scores: Influence (how the user’s 
tweets are received by others) and Outreach (how much 
the user tend to spread other’s tweets). 


2.2 General Remarks 

We conclude our review with three remarks concerning 
all features. First, an important fact regarding the se¬ 
lection of features is their availability. Depending on 
the context of the considered study, all the features 
we listed cannot necessarily be used, for several rea¬ 
sons. First, the dataset given for the study might be 
incomplete, relatively to the features one wants to pro¬ 
cess. For instance, if one has access to a collection of 
Tweets, he still has to retrieve the subscription infor¬ 
mation to be able to use Features from category Local 
Connections. But the Twitter API queries limitations 
might prevent him to access these data, or the con¬ 
cerned accounts may no longer exist, or the users may 
have changed their privacy settings. Some users also do 
not fill all the available fields, making it hard to use 
certain features from category User Profile, unless the 
tool used to analyze the data is able to handle missing 
values. 

There are also time-related constraints: the data col¬ 
lected in practice only correspond to those that can 
be obtained in a reasonable amount of time. Moreover, 
even if one manages to retrieve all the necessary data, 
the computation of certain features can be very de¬ 
manding if the dataset is too large, as we explained 
for Feature 26. Certain authors focus on the evolution 
of a given feature, by opposition to using a single value 
to summarize it. For instance, in [46], Lee et al. mea¬ 
sure the change rate of the number of followees (Feature 
17). This can significantly complicate the data retrieval 
task, since this requires measuring the feature at differ¬ 
ent moments. 

In our list, we omitted features one cannot com¬ 
pute in a normal context. For instance, when treating 
influence, Ramirez-de-la-Rosa et al. use a feature corre¬ 
sponding to the type of job a user holds [62] . However, 
this feature comes from the RepLab dataset (see Section 
3.2) and was manually defined by a specialized agency. 
In practice, it is hardly possible to replicate exactly the 
same process on new data. 

Our second remark concerns the way features are 
computed. We tried to stay general, and focus on what 
each feature represents conceptually. However, in prac¬ 
tice, there are most of the times a number of ways 
to process a feature, which differ in various aspects. 
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We indicated the main variants in the Descriptors col¬ 
umn of Table 1. However, we should emphasize that 
this aspect is much more important for content-related 
features, especially those from categories Lexical As¬ 
pects and Stylistic Traits. Indeed, those features coming 
from the NLP and IR fields are very sensitive to the 
way the content is pre-processed. The most common 
processes, such as removing punctuations (or emoti¬ 
cons and other special symbols as in [62]) and hashtags 
marks, lower-casing the text, merging duplicated char¬ 
acters (i.e. turning whaaaat? into what?), can result 
in very different lexicon. Things get even more compli¬ 
cated when it comes to removing stop-words, since in 
practice each researcher uses his own list, often fitted 
manually to a specific issue. 

Finally, it is worth noting certain authors define 
more complex features by combining basic ones, such 
as the ones we listed in Table 1. For instance, in [69], 
Tommasel & Godoy define various ratios of the numbers 
of followers and followees (Feature 17), retweets (Fea¬ 
tures 21 and 22) and mentions (Features 25 and 26). 
In [46], Lee et al. use the ratio of the total length of 
the mentions present in the tweet, to the overall tweet 
length, both expressed in characters. This amounts to 
dividing Feature 25 by Feature 33. They also use the 
ratio of hashtag to tweet lengths, which is based on 
Features 36 and 33. Several other works use the same 
feature combination approach [4,8,16,61,72,75]. 

As mentioned before, the goal of this review was not 
to be exhaustive, which would be impossible given the 
number of works related to the characterization of Twit¬ 
ter users, but rather to present the most widespread and 
diverse features found in the literature. We focused on 
their meaning relatively to the user classification prob¬ 
lem, and organized them in a new typology. As an il¬ 
lustration, in the rest of this article, we select some of 
these features and apply them to an actual problem: 
the prediction of offline influence. 


3 Application to Offline Influence 

We illustrate the relevance of our feature review with an 
application to the prediction of offline influence based 
on Twitter data. In this section, we first define the no¬ 
tion of influence, and we discuss the difference between 
online and offline influence. We then describe RepLab 
2014, a CLEF challenge aiming at the identification of 
Twitter users which are particularly influential in the 
real-world. Finally, we select a subset of the features 
presented in Section 2, in order to tackle this problem. 


3.1 Notion of Influence 

The Oxford Dictionary defines influence as "The ca¬ 
pacity to have an effect on the character, development, 
or behavior of someone or something". Various factors 
may be taken into account to measure the online in¬ 
fluence of Twitter users. Intuitively, the more a user is 
followed, mentioned and retweeted, the more he seems 
influential [13]. Nevertheless, there is no consensus re¬ 
garding which features are the most relevant, or even if 
other features would be more discriminant. Most of the 
existing academic works consider the way the user is 
interacting with others (e.g. number of followers, men¬ 
tions, etc.), the information available on his profile (age, 
user name, etc.) and the content he produces (number 
of tweets posted, textual nature of the tweets, etc). Sev¬ 
eral influence assessment tools were also proposed by 
companies such as Klout [60] and Kred [40] . 

Interestingly, these tools can be fooled by users im¬ 
plementing certain simple strategies. Messias et al. [52] 
showed that a hot can easily appear as influential to 
Klout and Kred. Additionally, Danisch et al. [21] ob¬ 
served that certain particular users called Social Capi¬ 
talists are also considered as influential although they 
do not produce any relevant content. Indeed, the strat¬ 
egy applied by social capitalists basically consists in 
following and retweeting massively each other. On a re¬ 
lated note, Lee et al. [48] also showed that users they 
call Crowdturfers use human-powered crowdsourcing 
to obtain retweets and followers. Finally, several data 
mining approaches were proposed regarding how to be 
retweeted or mentioned in order to gain visibility and 
influence [6,47,56,67]. 

A related question is to know how the user influence 
measured on Twitter (or some other online networking 
service) translates in terms of actual, real-world influ¬ 
ence. In other words: how the online influence matches 
the offline influence. Some researchers proposed meth¬ 
ods to detect Influencers on the network, however ex¬ 
cept for some rare cases of very well known influential 
people, validation remains rarely possible. For this rea¬ 
son, there is only a limited number of studies linking 
real-life and network-based influence. Bond et al. [10] 
explored this question for Facebook, with their large- 
scale study about the influence of friends regarding elec¬ 
tions, and especially abstention. They showed in partic¬ 
ular that people who know that their Facebook friends 
voted are more likely to vote themselves. More recently, 
two conference tasks were proposed in order to inves¬ 
tigate real-life influencers based on Twitter: PAN [59] 
and RepLab [3[. In this work, we focus on the latter, 
which is described in detail in the next subsection. 
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3.2 RepLab Challenge 

The RepLab Challenge 2014 dataset [3] was designed 
for an influence ranking challenge organized in the con¬ 
text of the Conference and Labs of the Evaluation Fo¬ 
rum^ (CLEF). Based on the online profiles and activity 
of a collection of Twitter users, the goal of this chal¬ 
lenge was to rank these users in terms of offline (i.e. 
real-world) influence. This is exactly the task we want 
to perform here, which makes this dataset particularly 
relevant to us. We therefore use these data for our own 
experiments. In this subsection, we first describe the 
context of the challenge and the data. Then, we explain 
how the performance was evaluated, and we discuss the 
results obtained during the challenge, as a reference for 
later comparisons. Finally, we present a classification 
variant of the problem, which we will tackle in addition 
to the ranking task. 

3.2.1 Data and task 

The main goal of the RepLab challenge is to detect 
offline influence using online Twitter data. The Re¬ 
pLab dataset contains users manually labeled by spe¬ 
cialists from Llorente & Cuenca^, a leading Spanish e- 
Reputation Arm. These users were annotated accord¬ 
ing to their perceived real-world influence, and not by 
considering specifically their Twitter account,although 
annotators only considered users with at least 1,000 
followers. The annotation is binary: a user is either an 
Influencer or a Not-Influencer. The dataset contains a 
training set of 2500 users, including 796 Influencers, 
and a testing set of 5900 users, including 1563 Influ¬ 
encers. It also includes the 600 last tweet IDs of each 
user, at the crawling and annotation time. This rep¬ 
resents a total of 4,379,621 tweets, i.e. around 750 
megabytes of data. These tweets can be written either 
in English or in Spanish. The dataset is publicly avail¬ 
able^. RepLab Anally provides a bounded and well de¬ 
signed framework to efficiency evaluate features and au¬ 
tomatic influence detection systems. 

Given the low number of real Influencers, the Re¬ 
pLab organizers modeled the issue as a search problem 
restrained to the Automotive and Banking domains. In 
other words, the dataset was split in two, depending 
on the main activity domain of the considered users. 
The domains are mutually exclusive, i.e. one user be¬ 
longs to exactly one domain. The objective was to rank 
the users in both domains in the decreasing order of in¬ 
fluence. Both domains are balanced, with 2353 (testing, 

^ http://www.clef-initiative.eu/ 

^ http: //www.llorenteycuenca.com/ 

® http://nlp.uned.es/replab2014/ 


including 764 Influencers) and 1186 (training) users for 
the Automotive domain, and 2507 (testing, 712 Influ¬ 
encers) and 1314 (training) for the Banking domain. 

The organizers proposed a baseline consisting in 
ranking the users by descending number of followers. 
Basically, this consists in considering that the more a 
given user has followers, the more he is expected to be 
influential offline. This baseline is directly inspired by 
online influence measurement tools. 

3.2.2 Evaluation 

The RepLab framework [3] uses the traditional Mean 
Average Precision (MAP) to evaluate the estimated 
rankings. The MAP allows comparing an ordered vector 
(output of a submitted method) to a binary reference 
(manually annotated data). In the case of RepLab, it 
was computed independently from the language, and 
separately for each domain. 

For a given domain, the Mean Average Precision 
MAP is computed as follows [12]: 

1 ^ 

MAP=-^P(*)9(z) (1) 

i=l 

where N is the total number of users, n the number 
of Influencers correctly found (i.e. true positives), P{i) 
the precision at rank i (i.e. when considering the first i 
users detected) and q{i) is 1 if the user is influential, 
and 0 otherwise. 

RepLab participants were compared according to 
the Average MAP processed over both Automotive and 
Banking domains. 

3.2.3 Results 

The UTDBRG group used Trending Topics Informa¬ 
tion, assuming that Influencers tweet mainly about so- 
called Hot Topics [2]. According to the official evalua¬ 
tion, their proposal obtained the highest MAP for the 
Automotive domain (0.721) and the best Average MAP 
among all participants (0.565). LfAMGLYR combined 
user profile features and what they call writing behavior 
(lexical richness, words and frequency of special char¬ 
acters) using Markov Random Fields [74]. Still with an 
NLP perspective, ORM_UNED [51] and LyS [73] in¬ 
vestigated POS tags as additional features to those ex¬ 
tracted from tweet contents. LyS also fed a classifier 
with bag-of-words built on the textual description pub¬ 
lished on certain profiles. Their proposal obtained the 
highest MAP for the Banking domain (0.524) and the 
second Average MAP among all participants (0.563). 

Based on the assumption that Influencers tend to 
use specific terms in their tweets, the LIA group opted 
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to model each user based on the textual content as¬ 
sociated to his tweets [19]. Using A:-Nearest Neighbors 
(fc-NN), they then matched each user to the most sim¬ 
ilar ones in the training set. More recently, the same 
team proposed some enhancements of this approach 
[20] . They used a different tuning criterion and observed 
ranking improvements relatively to their official chal¬ 
lenge submission which was outperformed with 0.764 
and 0.652 MAP for Automotive and Banking, respec¬ 
tively, and a 0.708 Average MAP. Also using a text- 
based method, our team (Cossu et al. [18]) obtained 
even higher results with MAP reaching 0.803 and 0.714 
for the Automotive domain and in Average, respec¬ 
tively. The performance for Banking remained lower 
with a 0.626 MAP. 

In RepLab participants submissions, performance 
differences observed between domains are likely due to 
the fact one domain is more difficult to process than 
the other. The Followers baseline remains lower than 
most submitted systems, achieving a MAP of 0.370 for 
Automotive and 0.385 for Banking. All these values are 
summarized in Table 3, in order to compare them with 
our own results. 

3 . 2.4 Classifieation Variant 

Because the reference itself is only binary, the RepLab 
ordering task can alternatively be seen as a binary clas¬ 
sification problem, consisting in deciding if a user is an 
Influencer or not. However, this was not a part of the 
original challenge. Ramirez et al. [62] recently proposed 
a method to tackle this issue. We will also consider this 
variant of the problem in the present article. 

To evaluate the classifier performance, Ramirez et 
al. used the F-Score averaged over both classes, based 
on the Precision and Recall processed for each class, 
which is typical in categorization tasks. This Maero Av¬ 
eraged F-Seore is calculated as follows: 

p) 

c 

where Pc and Rc are the Precision and Recall obtained 
for class c, respectively, and k is the number of classes 
(for us: 2). The performance is considered for each do¬ 
main (Banking and Automotive), as well as averaged 
over both domains. It gives an overview of the system 
ability to recover information from each class. 

Ramirez et al. do not use any baseline to assess their 
results. Nevertheless, the imbalance between the infiu- 
encer (31%) and non-infiuencer (69%) in the dataset 
leads to a strong non-informative baseline which simply 
consists in putting all users in the majority class (non- 
infiuencers). This baseline, called MF-Baseline (most 


frequent class baseline) achieves a 0.50 Macro Averaged 
F-Score. 

For this classification task, Ramirez et al. reached a 
MAP of 0.696 and 0.693 for Automotive and Banking 
domains, respectively, and a 0.694 Macro Averaged F- 
score. On the same task, our team (Cossu et al. [18]) 
proposed a classification method based on tweet con¬ 
tent, but obtained relatively low results (0.40 Macro 
Averaged F-Score). 

3.3 Experimental Setup 

In order to tackle the offline influence problem, we 
adopted an exploratory approach: we do not know a 
priori which features from Table 1 are relevant for the 
considered problem. So, we selected as many of them 
as possible. However, we could not take advantage of 
all of them, or use all the descriptors available for a 
given feature, be it for computational or time issues, 
because the necessary data were not available, or sim¬ 
ply for practical reasons. In this subsection, we list the 
selected features, which include both scalars and vec¬ 
tors. We also describe how we processed them, in func¬ 
tion of their nature. The scripts^ corresponding to this 
processing are publicly available online, as well as the 
resulting outputs^. 

3.3.1 Scalar Features 

We selected scalar features from each category of Table 
1: User Profile (Features 1-5), Publishing Activity (Fea¬ 
tures 12, 14 and 16), Local Connections (Features 17- 
19), User Interaction (Features 21-25), Stylistic Traits 
(Features 33, 36 and 37), and External Data (Features 
39 and 40). For Lexical Aspects, as explained in Section 
3.3.3, we defined additional scalar features by averag¬ 
ing several vectors corresponding to Feature 30 (term 
cooccurrences or bigrams). 

Some of these features can be handled through sev¬ 
eral descriptors, so we had to make some additional 
choices. For Feature 16 (geolocation), we considered 
both the number of distinct locations from which the 
user twitted, and the proportion of geolocated tweets 
among his published tweets. Our intuition to consider 
geolocation-related features was that some users might 
tweet from some places of power or decision (relatively 
to their activity domain), which could be a good in¬ 
dicator of real-world influence. Regarding Feature 17 
(neighbors), we used the number of followers, number 
of followees, and the number of users which are both at 

https: //github.com/CompNet/Influence 
® http: //dx.doi.org/ 10.6084/m9.figshare. 1506785 
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the same time (i.e. cardinality of the intersection of the 
follower and followee sets). For Feature 19 (neighbors 
ids), we considered the standard deviation of the ids of 
the 5000 most recent followers, and did the same for the 
followees. The topology of the follower-followee network 
has proven to be an important feature for the predic¬ 
tion of online influence, so it is worth a try when dealing 
with offline influence. We investigated Feature 33 (tweet 
length) considering average values expressed in terms 
of both number of characters and number of words. We 
discarded min and max values, because in our dataset 
they tend be the same (0 and 140) for all users. We 
think tweet length is likely to be relevant to identify 
autorities, which we suppose have more to say than non- 
influential people. For Feature 25 (mentions), we used 
the number of mentions by tweet, number of unique 
mentions by tweet, proportion of tweets that contain 
mentions, and total number of distinct usernames men¬ 
tioned. Regarding Favorites (Features 23 and 23), we 
hypothesized that tweets from influential users are of¬ 
ten marked as favorites by other users while influencers 
do not use this functionality. For Feature 36 (hashtags), 
we used the number of unique hashtags, the number of 
hashtags by tweet, the number of unique hashtags by 
tweet, and the proportion of tweets that contain hash- 
tags. We selected these features because previous re¬ 
sults such as [2] indicate that user activity on trending 
topics is a great indicator of influence. Similarly, for 
Feature 37 (URLs), we distinguished the numbers of 
URLs by tweet, of unique URLs by tweet, and the pro¬ 
portion of tweets that contain URLs. Note that for the 
last 3 features, the uniqueness was determined over all 
the user’s tweets (in the limit of the RepLab dataset), 
and not tweet-by-tweet. Our assumption here was that 
influential users tend to share links towards websites re¬ 
lated with their profession or the activity domain, and 
possibly aiming at specific types of medias. However, for 
technical reasons, it was not possible to expend short 
URLs or to follow links, so we could not completely put 
this idea to the test. 


We used non-linear classifiers under the form of ker- 
nelized SVMs (RBF, Polynomial and Sigmoid kernels) 
and logistic regression. We trained them using three 
distinct approaches: first with each scalar feature alone, 
second with all combinations of scalar features within 
each category defined by us (as described in Table 1, 
and third with all the scalar features at once. The two 
domains from the dataset {Banking and Automotive) 
were considered together and separately. 


3.3.2 Term Oecurrenees 

As mentioned in Section 2, Feature 30 focuses on the 
lexical aspect of tweets content. We now describe the 
different methods we used to take advantage of this 
feature. We focus on term occurrences, i.e. unigrams, 
in this subsection, and on term cooccurrences, i.e. bi¬ 
grams, in the next. As a preprocessing step, the tweets 
were first lower-cased, we removed words composed of 
only one or two letters, URLs, as well as punctuation 
marks, but we kept mentions and hashtags as they were. 

We defined our term-weighting using the classic 
Term Frequency ~ Inverse Document Frequency (TF- 
IDF) approach [65], combined with the Gini Purity 
Criterion [71]. We first introduce these measures in a 
generic way, before explaining how we applied them to 
our data. 

The Term Frequency TFd{i) corresponds to the 
number of occurrences of the term i in the document d. 
The Inverse Document Frequency IDF{i) is defined as 
follows: 

IDF{I)=log{^^) (3) 

where N is the number of documents in the training set, 
and DF{i) is the Document Frequency, i.e. the number 
of documents containing term i in the training set. 

The purity G{i) of a word i is defined as follows: 

( 4 ) 

cG C cG C 

where C is the set of document classes and DFc{i) is 
the class-wise document frequency, i.e. the number of 
documents belonging to class c and containing word i, 
in the training set. G{i) indicates how much a term i is 
spread over the different classes. It ranges from l/jCj 
when a given word i is well spread in all classes, to 1 
when the word only appears in a single class. 

These measures are combined to define two distinct 
weights. First, the contribution uJi^d of a term i given a 
document d: 

uJ^,d = TFdii) X IDF{i) X G{i) (5) 

and second, the contribution uJi^c of a term i given a 
document class c: 

= DFc{i) X IDF{i) X G{i) (6) 

Based on these weights, one can compute the simi¬ 
larity between a test document d and a document class 
c using the Cosine function as follows: 

X UJi^c 

cos{d, c) = (7) 

y ted tec 
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where i represents a term, d is the set of terms con¬ 
tained in the considered document, and c is the set of 
all terms contained in the documents forming the con¬ 
sidered class. 

Now, let us see how we applied this generic approach 
to our specific case. First, note that each domain (Bank¬ 
ing and Automotive) is treated separately, since a user 
belongs to only one domain. Regarding the languages 
(English and Spanish), we considered two approaches: 
processing all tweets at once without any regard for 
the language (called Joint in the rest of the article) 
and treating the languages separately then combining 
the corresponding classes or ranking {Separated). The 
process itself is two-stepped. 

Our first step consists in determining which tweets 
to analyze for each user. We tested two different strate¬ 
gies: 1) use all the tweets provided by RepLab (strategy 
All) ; and 2) select only the most relevant tweets (strat¬ 
egy Artex). The latter consists in extracting only the 
10% most informative tweets the user published. For 
this purpose, we used a statistical Tweet Selection sys¬ 
tem developed in our research group, called Artex [70]. 
Briefly, it relies on a tf-idf-hased vector representa¬ 
tion of, on one side, the user’s average tweet, and on 
the other side, his vocabulary and sentences. The se¬ 
lection is performed by keeping tweets maximizing the 
cross-product between their vector, the vocabulary and 
the average tweet. 

Our second step consists in classifying the users 
based on the Cosine similarity defined in Equation 7. 
We tested two distinct approaches, which are indepen¬ 
dent from the strategy used at the first step. In both 
approaches, the i from Equation 7 correspond to the 
terms remaining after our preprocessing, and the set C 
contains two document classes, which are the two possi¬ 
ble prediction outcomes: Influential vs. Non-Influential. 
However, the nature of the documents d depends on the 
approach. 

The first approach is called User-as-Document 
(UaD) [39]. It consists in merging all the tweets pub¬ 
lished by a user into a single large document. In other 
words, in this approach, a user is directly represented 
by a document d. A class is also represented by a sin¬ 
gle composite document, containing all the tweets writ¬ 
ten by the concerned users. For instance, the document 
representing the Influential class is the concatenation 
of all tweets published by influential users. The classi¬ 
fication process is performed by assigning a user to the 
most similar class, while the ranking depends on the 
similarity to the Influential class. When the languages 
are treated separately {Separated approach), we may 
obtain several different classes and rankings for each 
user, which need to be combined to get the final re¬ 


sult. For this purpose, we weight the language-specific 
user-to-class similarities using the proportion of tweets 
belonging to the considered language, and sum. For in¬ 
stance, if the user posted twice as many English than 
Spanish tweets, the weight of the English similarity will 
be double of the Spanish one. 

We call the second approach Bag-of-Tweets (BoT), 
and it focuses on tweets instead of users. So this time, 
the documents d from Equation 7 correspond to tweets, 
and a user is represented by the set of tweets he pub¬ 
lished. A document class is also represented through 
such Bag-of-Tweets (i.e. influential vs. non-influential 
tweets). We compute the similarity between each user 
BoT and each class BoT, then decide the classifica¬ 
tion outcome using a voting process. We considered 
two variants: the first one (called Count) consists in 
keeping the majority class among the user’s tweets, 
whereas the second one (called Sum) is based on the 
sum of the user’s tweet similarity to the class Influencer. 
The ranking is obtained by ordering users depending 
on the count or sum obtained for the Influential class. 
When the languages are treated separately {Separated 
approach), document classes are represented by several 
distinct BoTs (one for each language). In order to com¬ 
bine the possibly different classes or rankings obtained 
for each language, we use the same approach than be¬ 
fore: we weight the votes using the proportion of tweets 
belonging to the considered language. 

3.3.3 Term Cooceurrences 

We also processed Feature 30 based on bigrams. The 
tweets were preprocessed in the following way: the text 
was lowercased, we removed words with one or two 
letters, URLs, punctuation marks and stop-words (We 
used simple stop-lists available on the Oracle Website®). 
Then, for each user, we processed a matrix representing 
how many times each word pair (bigram) appears con¬ 
secutively, over all the tweets he posted. This consists in 
representing each user by a document containing all his 
tweets, like we did in the User-as-Document approach 
from the previous subsection, except the focus is now on 
coocurrences instead of occurrences. The obtained ma¬ 
trix is then considered as the adjacency matrix of the 
so-called cooccurrence graph. Each node in this graph 
represents a term, and the weight associated to a link 
connecting two nodes is the number of times the corre¬ 
sponding terms appear together in the text. 

Two users can be compared directly by comput¬ 
ing the distance between their respective cooccurrence 
matrices. For this purpose, we simply used the Eu¬ 
clidean distance. We then applied the k Nearest Neigh- 
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bors method (fc-NN) to separate Influential and Non- 
Influential users by matching each user of the test col¬ 
lection to the k closest profiles of the training set. We 
tried different values of k, ranging from 1 to 20. During 
the voting process, each neighbor vote is weighted using 
his similarity to the user of interest. The ranking is ob¬ 
tained by processing a score corresponding to the sum 
of the influential neighbors’ similarities. Like before, the 
domains were treated jointly and separately, and the 
results obtained for different languages are combined 
using the method previously described for the UaD ap¬ 
proach (Section 3.3.2). 

It is also possible to summarize a cooccurrence 
graph through the use of a nodal topological measure, 
i.e. a function associating a numerical score to each 
node in the graph, describing its position in the said 
graph. Many such measures exist, taking various as¬ 
pects of the graph topology into account [26,44]. We 
selected a set of classic nodal measures: Betweenness 
[28], Closeness [7], Eigenvector [9] and Subgraph [25] 
centralities. Eccentricity [35], Local Transitivity [76], 
Embeddedness [43], Within-module Degree and Partici¬ 
pation Coefficient [34] . These measures are described in 
detail in Appendix A. We selected them because they 
are complementary: certain are based on the local topol¬ 
ogy (degree, transitivity), some are global (betweenness, 
closeness. Eigenvector and subgraph centralities, eccen¬ 
tricity), and the others rely on the network community 
structure, and are therefore defined at an intermediary 
level (embeddedness, within-module degree, participa¬ 
tion coefficient). 

Each nodal measure leads to a vector of values, each 
representing one specific term in the cooccurrence net¬ 
work. For a given measure, a user is consequently rep¬ 
resented by such a vector. We process it using the same 
SVMs than for the scalar features (Section 3.3.1). Note 
that for the scalar features, each value of the SVM in¬ 
put vector represents a distinct feature, whereas here 
it corresponds to the centrality measured for one term. 
Alternatively, we also computed the arithmetic means 
of these vectors, for each nodal measure taken indepen¬ 
dently, and used them as scalar features, as indicated 
in Section 3.3.1. 


4 Results and Discussions 

In this Section, we present the results we obtained on 
the RepLab dataset. We consider first the classification 
task, then the ranking one. Finally, we use a more visual 
approach to illustrate our discussion about the predic¬ 
tion of offline influence based on the features extracted 
from Twitter data. 


4.1 Classification 

The kernelized SVMs we applied did not converge when 
considering scalar features, be it individually, by cate¬ 
gory, by combining categories and all together. We ob¬ 
tained the same behavior for the vector descriptors ex¬ 
tracted from Feature 30 (bigrams). This means the cen¬ 
trality measures used to characterize the coocurrence 
network were inefficient to And a non-linear separa¬ 
tion of our two classes. Those results were confirmed 
by the logistic regressions: none of the trained clas¬ 
sifiers performed better than the most-frequent class 
baseline (all user as non-influential). We also applied 
Random forests, which gave the same results. Mean¬ 
while, as stated in Section 3.3, these classifiers usually 
perform very well for this type of task. 

However, we obtained some results for the remaining 
descriptors of Feature 30, as displayed in Table 2. The 
classification performances are shown in terms of F- 
Score for each domain and averaged over domains, as 
explained in Section 3.2. For comparison purposes, we 
also reported in the same table the baseline, the results 
obtained by Rami'rez-de-la-Rosa et al. [62] using SVM, 
and those of Cossu et al. [18], based on tweets content 
(Section 3.2). 

In Table 2, one can observe that, except for the re¬ 
sults provided by Rami'rez-de-la-Rosa et al. [62], the 
performance obtained for the Banking domain is always 
lower than for the Automotive domain. This confirms 
our observation from Section 3.2, regarding the higher 
difficulty to detect a user’s influence for Banking than 
for Automotive. 

As mentioned before, the cooccurrence networks ex¬ 
tracted from Feature 30 were processed by the fc-NN 
method. The different fc values we tested did not lead 
to significantly different results. The best one is dis¬ 
played in Table 2 and is clearly below the baseline for 
both domains. The features absent from the table were 
not able to reach the baseline level, let alone state-of- 
the-art scores. 

The NLP cosine-based approaches applied to Fea¬ 
ture 30 showed competitive performances, noticeably 
higher than the baselines. Without language specific 
processing {Joint method), the Bag-of-Tweets approach 
obtained state-of-the-art results, while the User-as- 
Document one outperformed all existing methods re¬ 
ported for this task, up to our knowledge. For both 
approaches, the performances are clearly improved 
when processing the languages separately {Separated 
method). This might be due to the fact certain words 
are used in both languages, but in different ways. 

Regarding the decision strategy used for BoT, sum¬ 
ming {Sum) the votes improves the performance com- 
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Table 2 Classification performances ordered by Average T-Score. 

Feature and descriptor 



Automotive 

Banking 

Average 

Feature 30 User-as-Document 

Separated All 


.833 


.751 

.792 

Feature 30 User-as-Document 

Separated Artex 


.829 


.745 

.787 

Cossu et al [18] 



.812 


.751 

.781 

Feature 30 Bag-of-Tweets 

Separated Artex 

Sum 

.820 


.721 

.770 

Feature 30 Bag-of-Tweets 

Separated All 

Sum 

.817 


.709 

.763 

Feature 30 Bag-of-Tweets 

Separated Artex 

Count 

.796 


.719 

.757 

Feature 30 Bag-of-Tweets 

Separated All 

Count 

.786 


.702 

.744 

Feature 30 User-as-Document 

Joint All 


.782 


.682 

.732 

Feature 30 User-as-Document 

Joint Artex 


.773 


.672 

.722 

Ramirez-de-la-Rosa et al. [62] 



.696 


.693 

.694 

Feature 30 Bag-of-Tweets 

Joint All 

Count 

.725 


.641 

.683 

Feature 30 Bag-of-Tweets 

Joint All 

Sum 

.725 


.641 

.683 

MF-Baseline 



.500 


.500 

.500 

Feature 30 Cooccurrence networks 


.403 


.417 

.410 


pared to simply counting them (Count). This effect is 
more or less marked depending on the the way the lan¬ 
guages are treated: no effect for Joint, strong effect for 
Separated. The domain also affects this improvement, 
which is much smaller for Banking than for Automo¬ 
tive. This could indicate users behave differently, in 
terms of how they redact tweets, depending on their 
domain. This would be consistent with our assumption 
regarding the use of different terminology by influential 
users of distinct activity domains. 

The tweet selection step (approach Artex) affects 
differently the BoT and UaD methods. For the former, 
there is an increase in performance, compared to us¬ 
ing all available tweets (approach All). Moreover, this 
increase is noticeably higher for Banking than for Au¬ 
tomotive, which supports our previous observation re¬ 
garding redactional differences between domains. The 
latter method (UaD), on the contrary, is negatively af¬ 
fected by Artex. This can be explained in the following 
way: the tweet selection is a filter step, which reduces 
the noise contained in the user’s Bag-of-Tweets, thus 
causing an increase in performance. However, the User- 
as-Document method already performs a relatively sim¬ 
ilar simplification, lexically speaking, so the improve¬ 
ment is much smaller, or can even turn into a deterio¬ 
ration. 

The positive aspects of our results must be modu¬ 
lated by the fact the differences observed between the 
best unigram variants proposed for Feature 30, as well 
as Cossu et aVs method, are not statistically signifi¬ 
cant (according to a standard t-Test). More precisely, 
this observation concerns all rows from Table 2 between 
the first one and Ramirez-de-la-Rosa et aVs. The dif¬ 
ference with Rami'rez-de-la-Rosa et al.’s method could 
not be tested directly, because we could not have access 
to their classification output. Our results nevertheless 
demonstrate that detecting offline influence is more ef¬ 
ficiently tackled by taking content into account, rather 


than considering a large variety of text-independent fea¬ 
tures. In other words, for this task, writing similarities 
seem to be more relevant than any other Twitter-based 
information such as profile information, posting behav¬ 
ior or subscription-based interconnections. 


4.2 Ranking 

The results obtained for the ranking task are displayed 
in Table 3 in terms of MAP, for each domain and aver¬ 
aged over domains. Again, one can observe that except 
for very few features, all scores are lower for the Bank¬ 
ing domain than for the Automotive one. 

The UTDBRG row corresponds to the scores ob¬ 
tained at RepLab by the UTDBRG group [2], which 
reached the highest average performance and the best 
MAP for Automotive. This high performance for the 
Automotive domain, using an approach based on trend¬ 
ing topics, probably reflects a tendency for Influencers 
to be up-to-date with the latest news relative to brand 
products and innovation in their domain. This state¬ 
ment is not valid for Banking, where we can suppose 
that influence is based on more specialized and tech¬ 
nical discussions. This is potentially why our previous 
approach (Cossu et al.) based on tweets content ob¬ 
tained a good result for this domain, as mentioned in 
Section 3.2. 

As mentioned in Section 3.3.1, we first evaluated 
the logistic regression trained with each scalar feature 
alone, with each one of their categories, with each com¬ 
bination of category, and with all scalar features at 
once. The best results are presented on the row Best 
Regression, and were obtained by combining the se¬ 
lected features of the following categories (cf. Table 1): 
User activity, Profile fields. Stylistic aspects and Exter¬ 
nal data. The scores for this combination of features is 






16 


Jean-Valere Cossu et al. 


Table 3 Ranking performances ordered by Average MAP (the best 

ones are represented in 

bold). 


Feature and descriptor 


Automotive 

Banking 

Average 

Feature 30 User-as-Document Separated All 

.803 

.626 

.714 

Cossu et al. [18] 


.764 

.652 

.708 

Feature 30 Bag-of-Tweets Separated All Sum 

.779 

.628 

.703 

Feature 30 Bag-of-Tweets Separated Artex Sum 

.774 

.633 

.703 

Feature 30 User-as-Document Separated Artex 

.782 

.623 

.702 

Feature 30 Bag-of-Tweets Separated Artex Count 

.778 

.612 

.695 

Feature 30 Bag-of-Tweets Separated All Count 

.762 

.592 

.677 

Feature 30 User-as-Document Joint 

All 

.735 

.538 

.636 

Feature 30 User-as-Document Joint 

Artex 

.722 

.547 

.634 

Feature 30 Bag-of-Tweets Joint 

All Sum 

.699 

.526 

.612 

Feature 30 Bag-of-Tweets Joint 

All Count 

.626 

.504 

.565 

UTDBRG - Aleahmad et al. [2] 


.721 

.410 

.565 

Feature 12 Total Number of tweets 


.332 

.449 

.385 

Best Regression 


.424 

.338 

.381 

RepLab Baseline 


.370 

.385 

.378 

Feature 30 Cooccurrence networks 


.298 

.300 

.299 

Feature 40 Klout score 


.304 

.275 

.289 


just above the RepLab baseline, and far from the state- 
of-the-art approaches. 

For each numerical scalar feature, we also considered 
the features values directly as a ranking method. The 
best results were obtained using the number of tweets 
posted by each user (Feature 12). Although its aver¬ 
age MAP is just above the baseline, the performance 
obtained for the Banking domain is above UTDBRG, 
the previous state-of-the-art results. Thus, we may con¬ 
sider this feature as the new baseline of this specific 
domain. All others similarly processed features remain 
lower than the official baseline. The results obtained for 
Feature 40 reflect very poor rankings. This is very sur¬ 
prising, because this feature is the Klout Score, which 
was precisely designed to measure influence in general 
(i.e. both on- and offline). 

The rest of the results presented in Table 3 are the 
best we obtained for Feature 30. Those obtained us¬ 
ing the direct comparison of cooccurrence networks are 
slightly better than for the Klout Score. The cosine- 
based methods applied to Feature 30 led to very inter¬ 
esting results. The Bag-of-Tweets method obtained an 
average state-of-the-art performance, while the User-as- 
Document method reaches very high average MAP val¬ 
ues, even larger than the state-of-the-art, be it domain- 
wise (for Automotive and Banking) or in average. 

Compared to the classification results, the perfor¬ 
mances of the BoT and UaD methods are tighter, but 
the latter still dominate the former, though. Again, 
both methods get better results when the languages 
are treated separately (approach Separated). The BoT 
method still appear to perform better when using the 
Sum decision strategy (instead of Count). Including the 
tweet selection step (Artex) showed no significant per¬ 
formance changes, be it in terms of increase or decrease. 


This means describing a user based on the vocabulary 
he uses over all his tweets retains the information nec¬ 
essary to rank his influence level. 

Our results indicate that influential users from a 
specific domain behave differently and write in a par¬ 
ticular manner compared to other users. In other words, 
Influencers are characterized by a certain editorial be¬ 
havior. For bilingual users, as observed for the classifi¬ 
cation task, separating their tweets in order to process 
the languages separately led to improvements in the 
ranking performance. This suggests that words origi¬ 
nating from one language get a different meaning when 
used in the context of the other language. 

Ramirez-de-la-Rosa et al. [62] were able to take ad¬ 
vantage of certain scalar features to feed SVM-based 
classifiers in order to tackle the classification task, 
while RepLab participants such as the LyS [73] and 
UNED_ORM [51] groups did the same for the ranking 
task. However, we were not able to obtain any results 
when using the same classification tools and similar fea¬ 
tures (no convergence). The large variety of descriptors 
that can be considered for each feature may explain this 
difference: a wrong descriptor choice is quite sufficient 
to mislead the training process. Yet, it is sometimes dif¬ 
ficult or even impossible to And all the required details 
in the literature or the Web. This is the reason why 
we put our source code^ and outputs® online, in order 
to ease the replication of the process which led to the 
results presented in this article. 

Despite this performance reproduction point, our 
NLP-based methods reached higher scores than state- 
of-the-art works, for both classification and ranking. 
This indicates that typical SNA features classically 
used to detect spammers, social capitalists or influen¬ 
tial Twitter users, are not very relevant to detect of- 
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fline Influencers. In other terms, these typical features 
might be efficient to characterize influence perceived 
on Twitter, but not outside of it. Compared to other 
previous content-based methods, our approach consist¬ 
ing in representing a user under various forms of tweet 
bags-of-words also gave very good results. In particular, 
our User-as-Document method was far better than the 
best state-of-the-art approaches for both classification 
and ranking tasks. We suppose the way a user writes 
his tweets is related to his offline influence, at least 
for the studied domains. However, our attempt to ex¬ 
tend this occurrence-based approach to a cooccurrence- 
based one using graph measures did not lead to good 
performances. 


4.3 PLS path modeling 

In this last subsection, we come back to the scalar fea¬ 
tures and deepen the study of their relationship with of¬ 
fline influence through the use of Partial Least Squares 
Path Modeling (PLS-PM) [79]. 

The PLS algorithm handles all kinds of scales and is 
known to be well suited to combine nominal and binary 
variables. PLS-PM allows to represent a set of variables 
as a structure made of blocks of manifest (observed) 
variables. Each block is summarized by a latent vari¬ 
able, which depends on all the manifest variables con¬ 
stituting the block. PLS-PM estimates the best weights 
(between the manifest and latent variables, and be¬ 
tween the latent variable and the predicted variable), 
by calculating the solution of the general underlying 
model of multivariate PLS [36]. The R index is used 
to estimate the model quality (maximizes the square 
sum of correlations inside latent variables and between 
related variables). PLS-PM is a confirmatory approach 
which need an initial conceptual model derived from ex¬ 
perts knowledge and also allows to extract information 
from the data. Furthermore, it offers a graphical repre¬ 
sentation of the relations between manifest and latent 
variables, which is valuable for analysis, even by non¬ 
specialists. For an extensive review and more details on 
PLS path modeling, see [68] . 

Our application case (influence detection) can be 
viewed as a customer satisfaction index analysis as de¬ 
fined by Fornell [27] . We propose a conceptual model 
combining the predefined feature categories we defined 
Section 2 (cf. Table 1). Our objective is to explain why 
classifiers exploiting these features failed, and to dis¬ 
cover robust relations between latent variables. We also 
intend to investigate links between the features we se¬ 
lected and the values proposed by the best classifier ap¬ 
plied to Feature 30, since it performed very well. Our 


model has 4 hierarchical levels: first the features (man¬ 
ifest variables), each one connected to its category (la¬ 
tent variable), constituting the second level. Each cat¬ 
egory is in turn connected to either a Classifier vari¬ 
able (representing the classifier output) or a Reference 
variable (representing the ground truth from RepLab). 
We connected the content-based categories to Classifier 
(which is itself content-based), whereas the rest are con¬ 
nected to Reference. The Classifier variable is itself the 
third level, since it is also connected to the Reference 
variable the classifier output is supposed to be related 
to the actual influence). In other words, there are two 
types of categories in our model: those that directly in¬ 
duce Reference, and those related to the classifier, which 
in turn induces the Reference. 

The following experiments were made considering 
all users from the test set for which we could collect all 
features values, i.e. 2310 and 2410 users for Automotive 
and Banking, respectively. We selected as many features 
as possible and considered the method that obtained 
the best ranking result, that is to say: the UaD method 
applied to All tweets with Separated languages. As an 
example. Figure 1 shows the latent variables represent¬ 
ing the Publishing Activity and User Profile categories, 
and their related manifest variables. The other cate¬ 
gories are not displayed for space matters. The weights 
displayed in the figures correspond to the version of the 
correlation processed by PLS-PM. Note that a nega¬ 
tive sign does not necessarily correspond to a negative 
correlation: PLS-PM select the signs in order to maxi¬ 
mize the summed correlation values over the considered 
subgroup of variables. 


Automotive Domain 

Publishing Activity User Profile 




Delay between consecutive tweets 

Banking Domain 

Publishing Activity User Profile 




Fig. 1 Internal correlation of two latent variables: Publishing 
Activity (left column) and User Profile (right column) cat¬ 
egories, for the Automotive (top row) and Banking (bottom 
row) domain models. 
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Figure 1 shows the features correlation differ de¬ 
pending on the domain. For the Automotive domain, 
Featnres 15 and 16 have close to zero correlation values 
within their category, while Features 14 and 12 reach 
mnch higher absolute values. Feature 12, in particnlar, 
is very close to —1, which is consistent with the ob¬ 
servation we made in Section 4.2 regarding its nse as 
a good baseline. For the Banking domain, it is qnite 
the opposite: the Geolocation aspects are highly corre¬ 
lated, whereas the other featnres have a close to zero 
correlation. For User Profile, the behavior is the same 
for both domains, with a strong correlation of Feature 
5 (description length), and lesser correlation values for 
Features 2 (verified account) and 1 (image presence). 
It also indicates that influential users tend to have a 
complete account which allows people and mainly their 
followers to be snre about who their are. 

We now describe quickly our results for the other 
categories (not represented here). For the Automotive 
domain, the hashtag-related features are the main com¬ 
ponent of the Stylistic Traits category. It confirms the 
intuition from Aleahmad et al. [2] about the Influ- 
encers^ ability to be on the lookont for trending topics 
for this domain. For the Banking domain, the numbers 
of URLs and Uniqne URLs obtained the highest scores 
in this category. According to this observation, future 
works should look toward computing an informativity 
index over both the tweets and the URLs they contain, 
in order to improve influence detection. Additional tex¬ 
tual information from the targeted Web pages could 
also feed the NLP-based machine learning approaches 
to select the most relevant pages or part of pages. Con¬ 
cerning the Lexical Aspects category, Featnre 27 (lex¬ 
icon size) appears to be important for both domains, 
whereas Featnre 28 (hapaxes, i.e. words specific to the 
nser) reach a high correlation for the Automotive do¬ 
main only. 

Figure 2 depicts the second part of the regression 
model, i.e. the relationships between the latent vari¬ 
ables and the Classifier and Reference variables, as well 
as the relationship between Classifier and Reference. 
The Classifier variable is clearly correlated to the Refer¬ 
ence for both domains, althongh the valnes are closer to 
0.5 than 1 which confirms the classification and ranking 
results obtained for Featnre 30. Certain categories have 
close to zero correlation for both domain: User Pro¬ 
file, User Interaction and External Data (which, in onr 
case, contains only the Klont score) althongh the inter¬ 
nal correlations within these categories are high. This 
means the categories are homogeneons, bnt not relevant 
for influence prediction. Some categories reach a larger 
than 0.1 correlation (in absolute value): Publishing Ac¬ 
tivity for Automotive, Local Connections for Banking, 


Automotive Domain 

User 



Aspect Traits 

Banking Domain 

User 



Aspect Traits 

Fig. 2 Correlation between the feature categories for the Au¬ 
tomotive (top) and Banking (bottom) domains. 

and Lexical Aspects for both. The differences observed 
between the domains confirm our assumption that the 
notion of offline infinence takes a different form in Anto- 
motive and Banking. The Stylistic Traits category has 
a mnch higher correlation than the other ones, for both 
domains, which highlights the interest of content-based 
featnres. Overall, the correlation between the categories 
and the Classifier and Reference variables is very low. 
This means the model is unable to find strong links 
with the influence estimation according to these latent 
variables, and can be related to the fact the SVMs did 
not converge when applied to these featnres. 


5 Conclusion 

In this article, we have focnsed on the problem of nser 
characterization on Twitter, and more particnlarly on 
the featnres used in the literatnre to perform such a 
classification. We first investigated a wide range of fea¬ 
tnres coming from different research domains (mainly 
Social Network Analaysis, Natnral Langnage process¬ 
ing and Information Retrieval), before proposing a new 
typology of featnres. 

We then tackled the problem of identifying and 
ranking real-life Infiuencers (a.k.a. offline infiuencers) 
based on Twitter-related data, as specified by the Re- 
pLab 2014 challenge. For this experimental part, we can 
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highlight two main results. First, we showed that clas¬ 
sical SNA features used to detect spammers, social cap¬ 
italists or users influential on Twitter, do not give any 
relevant result on this problem. Our second result is the 
proposal of an NLP approach consisting in representing 
a user under various forms of bags-of-words, which led 
to a much better performance than all state-of-the-art 
methods (both content-based and -independent). From 
our result, we can suppose the way a user writes his 
tweets is related to his real-life influence, at least for 
the studied domains. This would confirm assumptions 
previously expressed in the literature regarding the fact 
users from specific domains behave and write in their 
own specific way. 

It is important to highlight the fact our experimen¬ 
tal results are valid only for the considered dataset. This 
means they are restricted to the domains it describes 
(Automotive and Banking), and are only as good as the 
manual annotation of the data. In RepLab 2014 [3], the 
organizers were not able to conclude on significant dif¬ 
ferences between participants (and features or methods 
used) due to the small number of considered domains. 
Furthermore, the delay between our experiments and 
the annotation of the data may cause some bias, since 
certain users stopped their activities while others be¬ 
came more involved and earned followers. 

We think our results could be improved thanks to 
content-independent features. In particular, we hypoth¬ 
esize a more advanced use of the geolocation feature 
could help identifying geographical areas from which 
Influencers tweet, e.g. financial places for the Banking 
domain. Our approach based on cooccurrence graphs 
did not result in good performances, but could be im¬ 
proved in two ways. First, it is possible to use other 
graph measures, at different levels (micro, meso and 
macro) [26] . Second, we could relax the notion of cooc¬ 
currence, by considering word neighborhoods of higher 
order. 
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A Centrality measures 

In their description, we note G = (U, E) the considered cooc¬ 
currence graph, where V and E are its sets of nodes and links, 
respectively. 

The Degree measure d{u) is quite straightforward: it is 
the number of links attached to a node u. So in our case, it 
can be interpreted as the number of words co-occurring with 
the word of interest. More formally, we note N{u) = {u G 
V : {u, r;} G E} the neighborhood of node u, i.e. the set of 
nodes connected to u in G. The degree d{u) = |A(ri)| of a 


node u is the cardinality of its neighborhood, i.e. its number 
of neighbors. 

The Betweenness centrality Gb{u) measures how much a 
node u lies on the shortest paths connecting other nodes. It 
is a measure of accessibility [28]: 


Cb{u) = ^ (8) 

V<'W 

Where is the total number of shortest paths from node v 
to node w, and cr„^(tt) is the number of shortest paths from 
V to w running through node u. 

The Closeness centrality Cc{u) quantihes how near a 
node u is to the rest of the network [7]: 


C,(u) 


1 

J2vevd--i'St{u,v) 


(9) 


Where dist{u, v) is the geodesic distance between nodes u and 
V, i.e. the length of the shortest path between these nodes. 

The Eigenvector centrality Ce(rt) measures the influence 
of a node u in the network based on the spectrum of its ad¬ 
jacency matrix. The Eigenvector centrality of each node is 
proportional to the sum of the centrality of its neighbors [9]: 


C.{u) = \ Y, Ge{v) ( 10 ) 

v^N {u) 

Here, A is the largest Eigenvalue of the graph adjacency ma¬ 
trix. 

The Subgraph centrality Cs{u) is based on the number 
of closed walks containing a node u [25]. Closed walks are 
used here as proxies to represent subgraphs (both cyclic and 
acyclic) of a certain size. When computing the centrality, each 
walk is given a weight which gets exponentially smaller as a 
function of its length. 


= E (11) 

e=o 

Where A is the adjacency matrix of G, and therefore 
corresponds to the number of closed walks containing u. 

The Eccentricity E{u) of a node u is its furthest 
(geodesic) distance to any other node in the network [35] : 


E{u) = ma.x{dist{u,v)) (12) 

v£V 

The Local Transitivity T{u) of a node u is obtained by 
dividing the number of links existing among its neighbors, by 
the maximal number of links that could exist if all of them 
were connected [76] : 


_ |{{u, w} E E : V E N{u) Aw E Ar(n)}| 
d{u){d{u) — l)/2 


(13) 


Where the denominator corresponds to the binomial coeffi¬ 
cient 2 ”^) • This measure ranges from 0 (no connected neigh¬ 
bors) to 1 (all neighbors are connected). 

The Embeddedness e(u) represents the proportion of 
neighbors of a node u belonging to its own community [43]. 
The community structure of a network corresponds to a parti¬ 
tion of its node set, defined in such a way that a maximum of 
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links are located inside the parts while a minimum of them lie 
between the parts. We note c{u) the community of node it, i.e. 
the parts that contains it. Based on this, we can define the in¬ 
ternal neighborhood of a node it as the subset of its neighbor¬ 
hood located in its own community: = N{u) n c(ii). 

Then, the internal degree d*”*(it) = |A^^*”*(it)| is defined as 
the cardinality of the internal neighborhood, i.e. the number 
of neighbors the node it has in its own community. Finally, 
the embeddedness is the following ratio: 


e{v) 


dint (^) 
d{v) 


(14) 


It ranges from 0 (no neighbors in the node community) to 1 
(all neighbors in the node community). 

The two last measures were proposed by Guimera & Ama¬ 
ral [34] to characterize the community role of nodes. For a 
node It, the Within Module Degree z{u) is defined as the 2 - 
score of the internal degree, processed relatively to its com¬ 
munity c(it): 


z{u) 


dintii^^ h^dint T 
er^dint: ^(^)) 


(15) 


Where /i and cr denote the mean and standard deviation of 
dint over all nodes belonging to the community of u, respec¬ 
tively. This measure expresses how much a node is connected 
to other nodes in its community, relatively to this commu¬ 
nity. By comparison, the embeddedness is not normalized in 
function of the community, but of the node degree. 

The Participation Coefficient is based on the notion of 
community degree, which is a generalization of the internal 
degree: di{u) = |Ar(it) f\ Ci\. This degree dc corresponds to 
the number of links a node u has with nodes belonging to 
community number i. The participation coefficient is defined 
as: 


P{u) = 1 - 



(16) 


Where k is the number of communities, i.e. the number of 
parts in the partition. P characterizes the distribution of the 
neighbors of a node over the community structure. More pre¬ 
cisely, it measures the heterogeneity of this distribution: it 
gets close to 1 if all the neighbors are uniformly distributed 
among all the communities, and to 0 if they are all gathered 
in the same community. 

Both community role measures are defined independently 
from the method used for community detection (provided it 
identifies mutually exclusive communities). In this work, we 
applied the InfoMap method [63], which was deemed very 
efficient in previous studies [54] . 






